Skip to content

feat: re-land graceful scale-down for AlluxioRuntime with e2e test#6061

Open
jakharmonika364 wants to merge 6 commits into
fluid-cloudnative:masterfrom
jakharmonika364:feat-alluxio-scaledown-e2e
Open

feat: re-land graceful scale-down for AlluxioRuntime with e2e test#6061
jakharmonika364 wants to merge 6 commits into
fluid-cloudnative:masterfrom
jakharmonika364:feat-alluxio-scaledown-e2e

Conversation

@jakharmonika364

Copy link
Copy Markdown
Contributor

Re-lands #5805, reverted in #6059 because it had no e2e coverage of the actual scale-down -> drain -> decommission -> data-integrity path.

This PR adds that e2e test under test/gha-e2e/alluxio-scaledown/: binds a 2-replica AlluxioRuntime, reads data, scales down to 1 replica, waits for the worker StatefulSet to actually converge, then reads again.

While wiring this up I found GracefulWorkerScaleDown was registered as a feature gate but never exposed as a flag on the alluxioruntime-controller binary (cmd/alluxio/app/alluxio.go), unlike csi and fluidapp, so it was unreachable in any real deployment. Fixed that too - the e2e test enables it via a deployment patch since there's no Helm value for it yet.

Original feature code is unchanged from #5805.

@fluid-e2e-bot

fluid-e2e-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yangyuliufeng for approval by writing /assign @yangyuliufeng in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fluid-e2e-bot

fluid-e2e-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

Hi @jakharmonika364. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new end-to-end test suite for testing the graceful scale-down of AlluxioRuntime, including the test script and associated Kubernetes resource manifests. It also exposes the feature gate flags in the Alluxio controller command. The review feedback highlights a critical missing blank import required to register the feature gates, and suggests improvements to the E2E test script, such as simplifying the wait-loop logging logic and properly scoping variables as local to prevent global namespace pollution.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread cmd/alluxio/app/alluxio.go
Comment thread test/gha-e2e/alluxio-scaledown/test.sh
Comment thread test/gha-e2e/alluxio-scaledown/test.sh
@jakharmonika364

Copy link
Copy Markdown
Contributor Author

Fixed the wait_dataset_bound and wait_job_completed issues. The blank import suggestion isn't needed though - pkg/ddc/alluxio already imports pkg/features directly (replicas.go), and cmd/alluxio/app/alluxio.go imports pkg/ddc/alluxio, so the init() already runs. Verified --feature-gates works via go run ./cmd/alluxio start --help.

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.69048% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.89%. Comparing base (36f0467) to head (e8c3871).

Files with missing lines Patch % Lines
pkg/ddc/alluxio/replicas.go 87.87% 11 Missing and 5 partials ⚠️
pkg/features/features.go 0.00% 2 Missing ⚠️
cmd/alluxio/app/alluxio.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6061      +/-   ##
==========================================
+ Coverage   64.77%   64.89%   +0.11%     
==========================================
  Files         484      486       +2     
  Lines       33892    34058     +166     
==========================================
+ Hits        21954    22101     +147     
- Misses      10215    10229      +14     
- Partials     1723     1728       +5     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Re-lands the AlluxioRuntime graceful worker scale-down feature by adding a new Kind-based GitHub Actions e2e scenario and exposing the --feature-gates flag on the alluxioruntime-controller binary so the gate can be enabled in real deployments.

Changes:

  • Add a new test/gha-e2e/alluxio-scaledown/ e2e scenario that scales an AlluxioRuntime from 2→1 and verifies data access before/after.
  • Expose feature gates on alluxioruntime-controller via DefaultMutableFeatureGate.AddFlag(...).
  • Wire the new e2e scenario into .github/scripts/gha-e2e.sh so it runs in CI.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/gha-e2e/alluxio-scaledown/test.sh New e2e driver script that enables the gate, creates runtime, scales down, and re-reads data
test/gha-e2e/alluxio-scaledown/read_before_job.yaml Job used to validate data access before scale-down
test/gha-e2e/alluxio-scaledown/read_after_job.yaml Job used to validate data access after scale-down
test/gha-e2e/alluxio-scaledown/dataset.yaml Dataset + 2-replica AlluxioRuntime manifest for the scale-down scenario
cmd/alluxio/app/alluxio.go Adds --feature-gates flag exposure for the alluxio controller binary
.github/scripts/gha-e2e.sh Runs the new alluxio scale-down e2e scenario in the CI e2e sequence

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/gha-e2e/alluxio-scaledown/test.sh Outdated
Comment thread test/gha-e2e/alluxio-scaledown/test.sh Outdated
Comment thread test/gha-e2e/alluxio-scaledown/test.sh
Comment thread test/gha-e2e/alluxio-scaledown/read_after_job.yaml Outdated
Comment thread test/gha-e2e/alluxio-scaledown/dataset.yaml Outdated
Comment thread test/gha-e2e/alluxio-scaledown/test.sh
Comment thread test/gha-e2e/alluxio-scaledown/test.sh
Comment thread test/gha-e2e/alluxio-scaledown/test.sh Outdated
Comment thread test/gha-e2e/alluxio-scaledown/dataset.yaml Outdated
…dvancedStatefulSet (fluid-cloudnative#4193) (fluid-cloudnative#5805)" (fluid-cloudnative#6059)

This reverts commit 333e13a.

Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
Adds a real-cluster scenario exercising the scale-down -> drain ->
decommission -> data-integrity path: bind a 2-replica AlluxioRuntime,
read data, scale down to 1 replica, wait for the worker StatefulSet to
actually converge (proving the drain doesn't stall), then read again.

Also wires --feature-gates into the alluxioruntime-controller binary
(cmd/alluxio/app/alluxio.go), mirroring cmd/csi and cmd/fluidapp -
without it GracefulWorkerScaleDown was registered but unreachable on a
real deployment, so the e2e test enables it via a deployment patch.

Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
Simplify wait_dataset_bound's elapsed-time tracking to a single
counter (matches wait_worker_replicas), and declare succeed/job_failed
as local in wait_job_completed to avoid polluting global scope.

Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
- check kubectl patch exit codes instead of letting failures surface
  as confusing downstream timeouts
- log the WorkerDecommissioning condition while polling so a stuck
  drain is debuggable from CI output
- replace the external apache.org mirror mount with an in-cluster
  MinIO bucket seeded with one known object, and assert its exact
  content before/after scale-down instead of just checking the
  directory is non-empty - the previous assertion would pass even if
  decommission silently dropped cached data instead of migrating it

Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
@jakharmonika364 jakharmonika364 force-pushed the feat-alluxio-scaledown-e2e branch from c3028aa to af5935d Compare June 25, 2026 10:23
@jakharmonika364

Copy link
Copy Markdown
Contributor Author

Found why CI was failing: this branch forked from master before #6059's revert merged, so the PR's merge-ref was silently dropping the whole feature (pkg/features, replicas.go logic, etc.) since my commits never touched those files directly. Rebuilt the branch off current master with the revert reverted, then replayed the e2e/flag-fix commits on top - --feature-gates=GracefulWorkerScaleDown=true now actually registers.

Comment thread test/gha-e2e/alluxio-scaledown/minio_create_bucket.yaml Fixed
- mountPoint: s3://scaledown/
name: scaledown
options:
alluxio.underfs.s3.endpoint: "http://scaledown-minio:9000"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this as http - it's an in-cluster only MinIO with throwaway test creds, same as curvine's existing endpoint_url: http://minio:9000 fixture. No real traffic to protect here.

Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
Alluxio mounts s3://scaledown/ at /scaledown inside its namespace
(named after the Dataset mount's name), not at the FUSE root. The
existing alluxio test never noticed this because its assertion just
checks "ls /data" is non-empty, which passes either way. Use the
correct /data/scaledown/fixture.txt path so the content check
actually exercises the fixture.

Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants